Manipulating and Visualizing Data Frames

While you follow this lab, you may want to open these cheat sheets:

Filestructure and Shell Commands

cd Desktop
mkdir lab05
cd lab05
mkdir data
mkdir report
mkdir images
ls
touch README.md
# added brief description of lab 05 with markdown syntax
open README.md
cd data
curl -O https://raw.githubusercontent.com/ucb-stat133/stat133-fall-2018/master/data/nba2018-players.csv
ls
wc nba2018-players.csv
head 5 nba2018-players.csv 
tail 5 nba2018-players.csv

I will include this code, but I only need to run this command once to download dplyr and ggplot2. install.packages(c("dplyr", "ggplot2"))

Installing Packages

About loading packages: Another rule to keep in mind is to always load any required packages at the very top of your script files (.R or .Rmd or .Rnw files). Avoid calling the library() function in the middle of a script. Instead, load all the packages before anything else.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
Path for Images

The other important specification to include in your Rmd file is a global chunk option to specify the location of plots and graphics. This is done by setting the fig.path argument inside the knitr::opts_chunk$set() function.

If you don’t specify fig.path, "knitr" will create a default directory to store all the plots produced when knitting an Rmd file. This time, however, we want to have more control over where things are placed. Because you already have a folder images/ as part of the filestructure, this is where we want "knitr" to save all the generated graphics. Notice the use of a relative path fig.path = '../images/'. This is because your Rmd file should be inside the folder report/, but the folder images/ is outside report/ (i.e. in the same parent directory of report/). I did this part at the beginning of the Rmd file.

NBA Players Data

The data file for this lab is: nba2018-players.csv. To import the data in R you can use the base function read.csv(), or you can also use read_csv() from the package "readr“:

library(readr)
setwd("/Users/sharonhui/Desktop/lab05/data")
dat <- read_csv('nba2018-players.csv')
## Parsed with column specification:
## cols(
##   player = col_character(),
##   team = col_character(),
##   position = col_character(),
##   height = col_integer(),
##   weight = col_integer(),
##   age = col_integer(),
##   experience = col_integer(),
##   college = col_character(),
##   salary = col_double(),
##   games = col_integer(),
##   minutes = col_integer(),
##   points = col_integer(),
##   points3 = col_integer(),
##   points2 = col_integer(),
##   points1 = col_integer()
## )

Basic “dplyr” verbs

To make the learning process of “dplyr” gentler, Hadley Wickham proposes beginning with a set of five basic verbs or operations for data frames (each verb corresponds to a function in “dplyr”):

Slightly modified Hadley’s list of verbs:

Filtering, slicing, and selecting

slice() allows you to select rows by position

filter() allows you to select rows by condition.

select() allows you to select columns by name

three_rows <- slice(dat, 1:3)
gt_85 <- filter(dat, height > 85)
player_height <- select(dat, player, height)

Your turn:

slice(dat, 1:5)
## # A tibble: 5 x 15
##              player  team position height weight   age experience
##               <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1        Al Horford   BOS        C     82    245    30          9
## 2      Amir Johnson   BOS       PF     81    240    29         11
## 3     Avery Bradley   BOS       SG     74    180    26          6
## 4 Demetrius Jackson   BOS       PG     73    201    22          0
## 5      Gerald Green   BOS       SF     79    205    31          9
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
slice(dat, c(10, 15, 20, 25, 30, 35, 40, 50))
## # A tibble: 8 x 15
##             player  team position height weight   age experience
##              <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1    Jonas Jerebko   BOS       PF     82    231    29          6
## 2     Tyler Zeller   BOS        C     84    253    27          4
## 3 Derrick Williams   CLE       PF     80    240    25          5
## 4     Jordan McRae   CLE       SG     78    185    25          1
## 5    Larry Sanders   CLE        C     83    235    28          5
## 6      Cory Joseph   TOR       PG     75    193    25          5
## 7     Jakob Poeltl   TOR        C     84    248    21          0
## 8     Bradley Beal   WAS       SG     77    207    23          4
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
slice(dat, ((nrow(dat))-4):(nrow(dat)))
## # A tibble: 5 x 15
##            player  team position height weight   age experience
##             <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1 Marquese Chriss   PHO       PF     82    233    19          0
## 2    Ronnie Price   PHO       PG     74    190    33         11
## 3     T.J. Warren   PHO       SF     80    230    23          2
## 4      Tyler Ulis   PHO       PG     70    150    21          0
## 5  Tyson Chandler   PHO        C     85    240    34         15
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
# Another way to do this
tail(slice(dat), 5)
## # A tibble: 5 x 15
##            player  team position height weight   age experience
##             <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1 Marquese Chriss   PHO       PF     82    233    19          0
## 2    Ronnie Price   PHO       PG     74    190    33         11
## 3     T.J. Warren   PHO       SF     80    230    23          2
## 4      Tyler Ulis   PHO       PG     70    150    21          0
## 5  Tyson Chandler   PHO        C     85    240    34         15
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
filter(dat, dat$height < 70)
## # A tibble: 2 x 15
##          player  team position height weight   age experience
##           <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1 Isaiah Thomas   BOS       PG     69    185    27          5
## 2    Kay Felder   CLE       PG     69    176    21          0
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
filter(dat, dat$team == "GSW")
## # A tibble: 16 x 15
##                  player  team position height weight   age experience
##                   <chr> <chr>    <chr>  <int>  <int> <int>      <int>
##  1     Anderson Varejao   GSW        C     82    273    34         12
##  2       Andre Iguodala   GSW       SF     78    215    33         12
##  3         Damian Jones   GSW        C     84    245    21          0
##  4           David West   GSW        C     81    250    36         13
##  5       Draymond Green   GSW       PF     79    230    26          4
##  6            Ian Clark   GSW       SG     75    175    25          3
##  7 James Michael McAdoo   GSW       PF     81    230    24          2
##  8         JaVale McGee   GSW        C     84    270    29          8
##  9         Kevin Durant   GSW       PF     81    240    28          9
## 10         Kevon Looney   GSW        C     81    220    20          1
## 11        Klay Thompson   GSW       SG     79    215    26          5
## 12          Matt Barnes   GSW       SF     79    226    36         13
## 13        Patrick McCaw   GSW       SG     79    185    21          0
## 14     Shaun Livingston   GSW       PG     79    192    31         11
## 15        Stephen Curry   GSW       PG     75    190    28          7
## 16        Zaza Pachulia   GSW        C     83    270    32         13
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
filter(dat, (dat$team =="GSW") & (dat$position == "C"))
## # A tibble: 6 x 15
##             player  team position height weight   age experience
##              <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1 Anderson Varejao   GSW        C     82    273    34         12
## 2     Damian Jones   GSW        C     84    245    21          0
## 3       David West   GSW        C     81    250    36         13
## 4     JaVale McGee   GSW        C     84    270    29          8
## 5     Kevon Looney   GSW        C     81    220    20          1
## 6    Zaza Pachulia   GSW        C     83    270    32         13
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
dat %>% 
filter(dat$team == "LAL") %>% 
select(player) 
## # A tibble: 14 x 1
##               player
##                <chr>
##  1    Brandon Ingram
##  2      Corey Brewer
##  3  D'Angelo Russell
##  4       David Nwaba
##  5       Ivica Zubac
##  6   Jordan Clarkson
##  7     Julius Randle
##  8         Luol Deng
##  9 Metta World Peace
## 10        Nick Young
## 11       Tarik Black
## 12   Thomas Robinson
## 13    Timofey Mozgov
## 14       Tyler Ennis
dat %>% 
filter(team == "GSW" & position == "PG") %>%
select(player, salary)
## # A tibble: 2 x 2
##             player   salary
##              <chr>    <dbl>
## 1 Shaun Livingston  5782450
## 2    Stephen Curry 12112359
dat %>% 
filter(experience > 10 & salary <= 10000000) %>%
select(player, age, team)
## # A tibble: 36 x 3
##               player   age  team
##                <chr> <int> <chr>
##  1      Andrew Bogut    32   CLE
##  2     Dahntay Jones    36   CLE
##  3    Deron Williams    32   CLE
##  4       James Jones    36   CLE
##  5       Kyle Korver    35   CLE
##  6 Richard Jefferson    36   CLE
##  7     Jose Calderon    35   ATL
##  8    Kris Humphries    31   ATL
##  9     Mike Dunleavy    36   ATL
## 10       Jason Terry    39   MIL
## # ... with 26 more rows
head(dat %>% 
filter(experience == "0" & age == "20") %>%
select(player, team, height, weight), 5)
## # A tibble: 5 x 4
##              player  team height weight
##               <chr> <chr>  <int>  <int>
## 1      Jaylen Brown   BOS     79    225
## 2    Henry Ellenson   DET     83    245
## 3 Stephen Zimmerman   ORL     84    240
## 4   Dejounte Murray   SAS     77    170
## 5    Chinanu Onuaku   HOU     82    245

Adding new variables: mutate()

Another basic verb is mutate() which allows you to add new variables. Let’s create a small data frame for the warriors with three columns: player, height, and weight:

# creating a small data frame step by step
gsw <- filter(dat, team == 'GSW')
gsw <- select(gsw, player, height, weight)
gsw <- slice(gsw, c(4, 8, 10, 14, 15))
gsw
## # A tibble: 5 x 3
##             player height weight
##              <chr>  <int>  <int>
## 1       David West     81    250
## 2     JaVale McGee     84    270
## 3     Kevon Looney     81    220
## 4 Shaun Livingston     79    192
## 5    Stephen Curry     75    190

Now, let’s use mutate() to (temporarily) add a column with the ratio height / weight:

mutate(gsw, height / weight)
## # A tibble: 5 x 4
##             player height weight `height/weight`
##              <chr>  <int>  <int>           <dbl>
## 1       David West     81    250       0.3240000
## 2     JaVale McGee     84    270       0.3111111
## 3     Kevon Looney     81    220       0.3681818
## 4 Shaun Livingston     79    192       0.4114583
## 5    Stephen Curry     75    190       0.3947368

Create a new name like ht_wt = height / weight:

mutate(gsw, ht_wt = height / weight)
## # A tibble: 5 x 4
##             player height weight     ht_wt
##              <chr>  <int>  <int>     <dbl>
## 1       David West     81    250 0.3240000
## 2     JaVale McGee     84    270 0.3111111
## 3     Kevon Looney     81    220 0.3681818
## 4 Shaun Livingston     79    192 0.4114583
## 5    Stephen Curry     75    190 0.3947368

In order to permanently change the data, you need to assign the changes to an object:

gsw2 <- mutate(gsw, ht_m = height * 0.0254, wt_kg = weight * 0.4536) 
gsw2
## # A tibble: 5 x 5
##             player height weight   ht_m    wt_kg
##              <chr>  <int>  <int>  <dbl>    <dbl>
## 1       David West     81    250 2.0574 113.4000
## 2     JaVale McGee     84    270 2.1336 122.4720
## 3     Kevon Looney     81    220 2.0574  99.7920
## 4 Shaun Livingston     79    192 2.0066  87.0912
## 5    Stephen Curry     75    190 1.9050  86.1840

Reordering rows: arrange()

The next basic verb of “dplyr” is arrange() which allows you to reorder rows. For example, here’s how to arrange the rows of gsw by height

arrange(gsw, height)
## # A tibble: 5 x 3
##             player height weight
##              <chr>  <int>  <int>
## 1    Stephen Curry     75    190
## 2 Shaun Livingston     79    192
## 3       David West     81    250
## 4     Kevon Looney     81    220
## 5     JaVale McGee     84    270

By default arrange() sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc().

arrange(gsw, desc(height))
## # A tibble: 5 x 3
##             player height weight
##              <chr>  <int>  <int>
## 1     JaVale McGee     84    270
## 2       David West     81    250
## 3     Kevon Looney     81    220
## 4 Shaun Livingston     79    192
## 5    Stephen Curry     75    190
# order rows by height, and then weight
arrange(gsw, height, weight)
## # A tibble: 5 x 3
##             player height weight
##              <chr>  <int>  <int>
## 1    Stephen Curry     75    190
## 2 Shaun Livingston     79    192
## 3     Kevon Looney     81    220
## 4       David West     81    250
## 5     JaVale McGee     84    270
Your Turn
  • using the data frame gsw, add a new variable product with the product of height and weight.
mutate(gsw, product = height * weight)
## # A tibble: 5 x 4
##             player height weight product
##              <chr>  <int>  <int>   <int>
## 1       David West     81    250   20250
## 2     JaVale McGee     84    270   22680
## 3     Kevon Looney     81    220   17820
## 4 Shaun Livingston     79    192   15168
## 5    Stephen Curry     75    190   14250
  • create a new data frame gsw3, by adding columns log_height and log_weight with the log transformations of height and weight.
gsw3 <- mutate(gsw, log_height = log(height), log_weight = log(weight))
  • use the original data frame to filter() and arrange() those players with height less than 71 inches tall, in increasing order.
new_dat <- filter(dat, dat$height < 71) 
arrange(new_dat, new_dat$height)
## # A tibble: 4 x 15
##           player  team position height weight   age experience
##            <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1  Isaiah Thomas   BOS       PG     69    185    27          5
## 2     Kay Felder   CLE       PG     69    176    21          0
## 3 Pierre Jackson   DAL       PG     70    180    25          0
## 4     Tyler Ulis   PHO       PG     70    150    21          0
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
  • display the name, team, and salary, of the top-5 highest paid players
head(select(arrange(dat, desc(salary)), player, team, salary), 5)
## # A tibble: 5 x 3
##          player  team   salary
##           <chr> <chr>    <dbl>
## 1  LeBron James   CLE 30963450
## 2    Al Horford   BOS 26540100
## 3 DeMar DeRozan   TOR 26540100
## 4  Kevin Durant   GSW 26540100
## 5  James Harden   HOU 26540100
  • display the name, team, and salary, for the top-5 highest paid players
head(select(arrange(dat, desc(salary)), player, team, salary), 5)
## # A tibble: 5 x 3
##          player  team   salary
##           <chr> <chr>    <dbl>
## 1  LeBron James   CLE 30963450
## 2    Al Horford   BOS 26540100
## 3 DeMar DeRozan   TOR 26540100
## 4  Kevin Durant   GSW 26540100
## 5  James Harden   HOU 26540100
  • display the name, team, and points3, of the top 10 three-point players
head(select(arrange(dat, desc(points3)), player, team, points3), 10)
## # A tibble: 10 x 3
##            player  team points3
##             <chr> <chr>   <int>
##  1  Stephen Curry   GSW     324
##  2  Klay Thompson   GSW     268
##  3   James Harden   HOU     262
##  4    Eric Gordon   HOU     246
##  5  Isaiah Thomas   BOS     245
##  6   Kemba Walker   CHO     240
##  7   Bradley Beal   WAS     223
##  8 Damian Lillard   POR     214
##  9  Ryan Anderson   HOU     204
## 10    J.J. Redick   LAC     201
# Another way
slice(select(arrange(dat, desc(points3)), player, team, points3), 1:10)
## # A tibble: 10 x 3
##            player  team points3
##             <chr> <chr>   <int>
##  1  Stephen Curry   GSW     324
##  2  Klay Thompson   GSW     268
##  3   James Harden   HOU     262
##  4    Eric Gordon   HOU     246
##  5  Isaiah Thomas   BOS     245
##  6   Kemba Walker   CHO     240
##  7   Bradley Beal   WAS     223
##  8 Damian Lillard   POR     214
##  9  Ryan Anderson   HOU     204
## 10    J.J. Redick   LAC     201
  • create a data frame gsw_mpg of GSW players, that contains variables for player name, experience, and min_per_game (minutes per game), sorted by min_per_game (in descending order)
data.frame(arrange(select(mutate(filter(dat, team == "GSW"), min_per_game = minutes/games), player, experience, min_per_game), desc(min_per_game)))
##                  player experience min_per_game
## 1         Klay Thompson          5    33.961538
## 2         Stephen Curry          7    33.392405
## 3          Kevin Durant          9    33.387097
## 4        Draymond Green          4    32.513158
## 5        Andre Iguodala         12    26.289474
## 6           Matt Barnes         13    20.500000
## 7         Zaza Pachulia         13    18.114286
## 8      Shaun Livingston         11    17.697368
## 9         Patrick McCaw          0    15.126761
## 10            Ian Clark          3    14.766234
## 11           David West         13    12.558824
## 12         JaVale McGee          8     9.597403
## 13 James Michael McAdoo          2     8.788462
## 14         Damian Jones          0     8.500000
## 15         Kevon Looney          1     8.433962
## 16     Anderson Varejao         12     6.571429

Summarizing values with summarise()

The next verb is summarise(). Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.

Say you are interested in calculating the average salary of all NBA players. To do this “a la dplyr” you use summarise(), or its synonym function summarize():

Calculating an average like this seems a bit verbose, especially when you can directly use mean() like this:

# average salary of NBA players
summarise(dat, avg_salary = mean(salary))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1    5804697
mean(dat$salary)
## [1] 5804697

What if you want to calculate some summary statistics for salary: min, median, mean, and max?

# some stats for salary (dplyr)
summarise(
  dat, 
  min = min(salary),
  median = median(salary),
  avg = mean(salary),
  max = max(salary)
)
## # A tibble: 1 x 4
##     min median     avg      max
##   <dbl>  <dbl>   <dbl>    <dbl>
## 1  5145  3e+06 5804697 30963450

Well, this may still look like not much. You can do the same in base R (there are actually better ways to do this):

# some stats for salary (base R) 

c(min = min(dat$salary), median = median(dat$salary), median = mean(dat$salary), max = max(dat$salary))
##      min   median   median      max 
##     5145  3000000  5804697 30963450
Grouped operations

To actually appreciate the power of summarise(), we need to introduce the other major basic verb in “dplyr”: group_by(). This is the function that allows you to perform data aggregations, or grouped operations.

Let’s see the combination of summarise() and group_by() to calculate the average salary by team:

# average salary, grouped by team 
summarise( group_by(dat, team), avg_salary = mean(salary) )
## # A tibble: 30 x 2
##     team avg_salary
##    <chr>      <dbl>
##  1   ATL    5494447
##  2   BOS    6127673
##  3   BRK    4011351
##  4   CHI    5781368
##  5   CHO    5531548
##  6   CLE    7069699
##  7   DAL    5157128
##  8   DEN    4648719
##  9   DET    6871632
## 10   GSW    6265160
## # ... with 20 more rows
# average salary, grouped by position
summarise(
  group_by(dat, position),
  avg_salary = mean(salary)
)
## # A tibble: 5 x 2
##   position avg_salary
##      <chr>      <dbl>
## 1        C    6529906
## 2       PF    5801127
## 3       PG    5601217
## 4       SF    6042455
## 5       SG    5114178
# average weight and height, by position, displayed in desceding order by average height
arrange(
  summarise(
    group_by(dat, position),
    avg_height = mean(height),
    avg_weight = mean(weight)),
  desc(avg_height)
)
## # A tibble: 5 x 3
##   position avg_height avg_weight
##      <chr>      <dbl>      <dbl>
## 1        C   83.21649   251.1031
## 2       PF   81.40816   235.2857
## 3       SF   79.52381   220.2381
## 4       SG   77.04902   204.3431
## 5       PG   74.32292   188.9583

Your turn:

use summarise() to get the largest height value.

summarise(dat, largest_height_value = max(height))
## # A tibble: 1 x 1
##   largest_height_value
##                  <dbl>
## 1                   87

use summarise() to get the standard deviation of points3.

summarise(dat, standard_deviation_of_points3 = sd(points3))
## # A tibble: 1 x 1
##   standard_deviation_of_points3
##                           <dbl>
## 1                      55.11807

use summarise() and group_by() to display the median of three-points, by team.

summarise(
  group_by(dat, team),
  median_points3 = median(points3)
)
## # A tibble: 30 x 2
##     team median_points3
##    <chr>          <dbl>
##  1   ATL           32.0
##  2   BOS           46.0
##  3   BRK           36.0
##  4   CHI           28.5
##  5   CHO           13.0
##  6   CLE           26.5
##  7   DAL           18.0
##  8   DEN           46.0
##  9   DET           28.0
## 10   GSW           10.5
## # ... with 20 more rows

display the average triple points by team, in ascending order, of the bottom-5 teams (worst 3pointer teams).

tail(arrange((summarise(
   group_by(dat, team),
    average_points3 = mean(points3)
)), desc(average_points3)), 5)
## # A tibble: 5 x 2
##    team average_points3
##   <chr>           <dbl>
## 1   CHI        35.31250
## 2   SAC        35.12500
## 3   ORL        34.33333
## 4   PHO        33.47059
## 5   NOP        32.43750
arrange((tail(arrange((summarise(
   group_by(dat, team),
    average_points3 = mean(points3)
)), desc(average_points3)), 5)), (average_points3))
## # A tibble: 5 x 2
##    team average_points3
##   <chr>           <dbl>
## 1   NOP        32.43750
## 2   PHO        33.47059
## 3   ORL        34.33333
## 4   SAC        35.12500
## 5   CHI        35.31250

obtain the mean and standard deviation of age, for Power Forwards, with 5 and 10 years (including) years of experience.

summarise(select(filter(dat, dat$position == "PF", dat$experience >=5 & dat$experience <= 10), age), mean_power_forwards = mean(age), sd_power_forwards = sd(age))
## # A tibble: 1 x 2
##   mean_power_forwards sd_power_forwards
##                 <dbl>             <dbl>
## 1            28.43243          2.267408

First contact with ggplot()

Scatterplots
Label your chunks!

When including code for plots and graphics, we strongly recommend that you create an individual code chunk for each plot, and that you give a label to that chunk.

# scatterplot (option 1)
ggplot(data = dat) +
  geom_point(aes(x = points, y = salary))

  • ggplot() creates an object of class “ggplot”

  • the main input for ggplot() is data which must be a data frame

  • then we use the "+" operator to add a layer

  • the geometric object (geom) are points: geom_points()

  • aes() is used to specify the x and y coordinates, by taking columns points and salary from the data frame

# scatterplot (option 2)
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point()

Adding color
# Say you want to color code the points in terms of position
# colored scatterplot 
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position))

# Maybe you wan to modify the size of the dots in terms of points3:
# sized and colored scatterplot 
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position, size = points3))

# To add some transparency effect to the dots, you can use the alpha parameter.
# sized and colored scatterplot 
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position, size = points3), alpha = 0.7)

Notice that alpha was specified outside aes(). This is because we are not using any column for the alpha transparency values.

Your Turn
  • Open the ggplot2 cheatsheet

  • Use the data frame gsw to make a scatterplot of height and weight.

ggplot(data = gsw, aes(x = height, y = weight)) + geom_point()

Find out how to make another scatterplot of height and weight, using geom_text() to display the names of the players.

ggplot(data = gsw, aes(x = height, y = weight)) + 
  geom_text(aes(label = player))

Get a scatter plot of height and weight, for ALL the warriors, displaying their names with geom_label().

ggplot(data = filter(dat, team == "GSW"), aes(x = height, y = weight)) + 
  geom_point() + 
  geom_label(aes(label = player))

Get a density plot of salary (for all NBA players).

ggplot(data = dat, aes(x = dat$salary)) + geom_density()

Get a histogram of points2 with binwidth of 50 (for all NBA player

ggplot(data = dat, aes(x=points2)) +
  geom_histogram(bins = 50)

Get a barchart of the position frequencies (for all NBA players).

ggplot(data = dat, aes( x= position)) +
  geom_bar()

Make a scatterplot of experience and salary of all Centers, and use geom_smooth() to add a regression line.

ggplot(data = filter(dat, dat$position == "C"), aes(x = experience, y = salary)) + geom_point(size = .2) + geom_smooth(method = lm)

Repeat the same scatterplot of experience and salary of all Centers, but now use geom_smooth() to add a loess line (i.e. smooth line).

ggplot(data = filter(dat, dat$position == "C"), aes(x = experience, y = salary)) + geom_point(size = .2) + geom_smooth(method = loess)

Faceting

One of the most attractive features of “ggplot2” is the ability to display multiple facets. The idea of facets is to divide a plot into subplots based on the values of one or more categorical (or discrete) variables.

Here’s an example. What if you want to get scatterplots of points and salary separated (or grouped) by position? This is where faceting comes handy, and you can use facet_wrap() for this purpose:

# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point() +
  facet_wrap(~ position)

The other faceting function is facet_grid(), which allows you to control the layout of the facets (by rows, by columns, etc)

# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position), alpha = 0.7) +
  facet_grid(~ position) +
  geom_smooth(method = loess)

# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position), alpha = 0.7) +
  facet_grid(position ~ .) +
  geom_smooth(method = loess)

Your turn:
  • Make scatterplots of experience and salary faceting by position.
# scatterplot by position
ggplot(data = dat, aes(x = experience, y = salary)) +
  geom_point(aes(color = position), alpha = 0.7) +
  facet_grid(~ position) +
  geom_smooth(method = loess)

  • Make scatterplots of experience and salary faceting by team
# scatterplot by team
ggplot(data = dat, aes(x = experience, y = salary)) +
  geom_point(aes(color = team), alpha = 0.7) +
  facet_wrap(team ~ .) +
  geom_smooth(method = loess)

  • Make density plots of age faceting by team
ggplot(data = dat, aes(x = age)) + geom_density() + facet_wrap(team ~ .)

  • Make scatterplots of height and weight faceting by position
ggplot(data = dat, aes(x = height, y = weight)) + geom_point(size = .5) + facet_wrap(position ~ .)

  • Make scatterplots of height and weight, with a 2-dimensional density, geom_density2d(), faceting by position
options(warn=-1)
# scatterplot by position
ggplot(data = dat, aes(x = height, y = weight)) +
  geom_point(aes(color = position), alpha = 0.7) +
  facet_wrap(position ~ .) +
  geom_smooth(method = loess) + geom_density2d()

  • Make a scatterplot of experience and salary for the Warriors, but this time add a layer with theme_bw() to get a simpler background
ggplot(data = filter(dat, dat$team == "GSW"), aes(x = experience, y = salary)) +
  geom_point(size = .5) + theme_bw()

  • Repeat any of the previous plots but now adding a leyer with another theme e.g. theme_minimal(), theme_dark(), theme_classic()
ggplot(data = filter(dat, dat$team == "GSW"), aes(x = experience, y = salary)) +
  geom_point(size = .5) + theme_classic()

More shell commands
  • Open the terminal.

  • Move inside the images/ directory of the lab.

  • List the contents of this directory.

  • Now list the contents of the directory in long format.

  • How would you list the contents in long format, by time?

  • How would you list the contents displaying the results in reverse (alphabetical)? order

  • Without changing your current directory, create a directory copies at the parent level (i.e. lab05/).

  • Copy one of the PNG files to the copies folder.

  • Use the wildcard * to copy all the .png files in the directory copies.

  • Change to the directory copies. Use the command mv to rename some of your PNG files.

  • Change to the report/ directory.

  • From within report/, find out how to rename the directory copies as copy-files.

  • From within report/, delete one or two PNG files in copy-files.

  • From within report/, find out how to delete the directory copy-files.

cd Desktop
cd lab05
cd images/
ls
ls -l
ls -l -t
ls -r -l
mkdir ../copies
cp scatterplotwithfacetgrid2-1.png ../copies
cp *.png ../copies
cd ..
cd copies
mv scatterplotwithfacetgrid2-1.png scatterplotwithfacetgrid2.png
mv scatterplotwithfacetgrid1-1.png scatterplotwithafacetgridaboutposition.png
mv scatterplotwithgeom_label-1.png scatterplotgeom_labelheightweight.png
mv scatterplotofexperiencesalaryposition-1.png scatterplotexpsalpos.png
cd ..
cd report/
mv ../copies ../copy-files
rm ../copy-files/repeatplotwiththeme_classic-1.png
rm ../copy-files/densityplotofageteam-1.png
rm -R ../copy-files